Q232 : Image captioning using Deep learning
Thesis > Central Library of Shahrood University > Computer Engineering > MSc > 2023
Authors:
Fatemeh Esmaili [Author], Fatemeh Jafarinejad[Supervisor], Prof. Hamid Hassanpour[Advisor]
Abstarct: Abstract In recent years, due to the increase of textual and image resources, as well as the rapid progress of artificial intelligence methods, image captioning has gradually attracted the attention of many researchers in the field of artificial intelligence. The task of image captioning, which is also referred to as image descxription, is to describe the content of the image using text display. In fact, according to the entities observed in the image, as well as the understanding of the scene and the relationship between its different parts, it automatically describes the image and returns this descxription in the form of a sentence that fits the standards of the language. This topic, which is a combination of computer visual knowledge and natural language processing, has recently become one of the most important topics in machine vision. Various approaches have been proposed for this task, among which models baxsed on deep learning have proven to be the most advanced models. Basically, deep learning-baxsed annotation relies on encoder-decoder models, which are two-component black boxes that work together to create new caption for images. Our proposed method also relies on this general principle. The main idea of this thesis is the use of information about the region of interest (ROI) in improving the quality of the generated descxriptions. In fact, the suggestion is to describe the images with two different points of view and then combine the results from these two points of view. This architecture proposes a two-step method in image captioning, in the first step, two different types of each image, which actually have two different views of each image, are entered into two caption networks with similar configuration. Each of these two networks offers a caption for the image. In the second stage, by introducing a new aggregation operator, the information obtained from two networks is aggregated and an accurate caption is produced that contains the important and non-repeating information of the previous two captions. We evaluated the proposed method of this thesis on the COCO dataset and by the BLEU criterion. The obtained results show that with this method, BLEU shows higher values, which indicates that the captions produced by this method are closer to the captions created by humans, so the quality and accuracy of the work has increased. Also, we applied our method on the Hugging Face model, which is one of the latest and most complex models in the field of image captioning. This method was able to be effective on this model and improve the results.
Keywords:
#Keywords: image captioning #deep learning #image processing #natural language processing. Keeping place: Central Library of Shahrood University
Visitor: